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Abstract 

In this paper, we attempt to explain the 
emergence of the linguistic diversity that 
exists across the consonant inventories of 
some of the major language families of the 
world through a complex network based 
growth model. There is only a single pa- 
rameter for this model that is meant to 
introduce a small amount of randomness 
in the otherwise preferential attachment 
based growth process. The experiments 
with this model parameter indicates that 
the choice of consonants among the lan- 
guages within a family are far more pref- 
erential than it is across the families. The 
implications of this result are twofold - 
(a) there is an innate preference of the 
speakers towards acquiring certain linguis- 
tic structures over others and (b) shared 
ancestry propels the stronger preferential 
connection between the languages within 
a family than across them. Furthermore, 
our observations indicate that this parame- 
ter might bear a correlation with the period 
of existence of the language families under 
investigation. 



1 Introduction 



In 

pers 



one 



of 



their 



(Hauseretal, 2002), 



seminal pa- 
Noam Chomsky 



and his co-authors remarked that if a Martian ever 
graced our planet then it would be awe-struck by 
the unique ability of the humans to communicate 
among themselves through the medium of lan- 
guage. However, if our Martian naturalist were 
meticulous then it might also note the surprising 
co-existence of 6700 such mutually unintelligible 



languages across the world. Till date, the terres- 
trial scientists have no definitive answer as to why 
this linguistic diversity exists (Pinker, 1994). Pre- 
vious work in the area of language evolution have 
tried to explain the emergence of this diversity 
through two different background models. The 
first one among these assumes that there is a set of 
predefined language configurations and the move- 
ment of a particular language on this landscape 
is no more than a random walk ( |Tomlin, 1 986; 
Dryer, 1992). The second line of research at- 



tempts to relate the ecological, cultural and demo- 
graphic parameters with the linguistic parameters 
responsible for this diversity (see for refer- 
ence ( |Arita and Taylor, 1996| |Maeda et al, 1997 



Kirby, 1998 



Livingstone and Fyfe, 1999 



INettle, 1999[ |Fought et al.,~2 004 )). 



From the above studies it turns out that lin- 
guistic diversity is an outcome of the language 
dynamics in terms of its evolution, acquisition 
and change. Like any physical system, the dy- 
namics of a linguistic system can also be viewed 
from three levels (|Arhem et al., 2004]) . On one 



extreme, it is a collection of utterances that are 
produced and perceived by the speakers of a 
linguistic community; this is analogous to the 
microscopic view of a thermodynamic system. 
On the other extreme, it can be expressed by a set 
of grammar rules and a lexicon; this is analogous 
to the macroscopic view. Sandwiched between 
these two extremes, one can also conceive a 
mesoscopic view of language, where linguistic 
entities such as phonemes, words or letters are 
the basic units and grammar is an emergent 
property of the interactions among these units. 
In the recent years, complex networks have 
proved to be an extremely suitable framework 
for modeling and studying the structure and 



dynamics of linguistic systems from a meso- 



scopic prespective (see ( 


Cancho and Sole, 2001 ; 


Dorogovtsev and Mendes, 2001 




Cancho and Sole, 2004 


Sole etal, 2005) for 



references). 

In this work, we attempt to investigate the 
diversity that exists across the consonant in- 
ventories of the world's languages through 
an evolutionary framework based on network 
growth. Along the lines of the study presented 



in (Choudhury et al., 2006 1, we model the struc- 
ture of the inventories through a bipartite network, 
which has two different sets of nodes, one la- 
beled by the languages and the other by the con- 
sonants. Edges run in between these two sets 
depending on whether a particular consonant is 
found in a particular language. This network is 
termed as the Phoneme-Language Network or 



PlaNet in ( [Choudhury et al, 2006| ). We construct 
five such networks that respectively represent the 
consonant inventories belonging to the five ma- 
jor language families namely, the Indo-European 
(IE-PlaNet), the Afro-Asiatic (AA-PlaNet), the 
Niger-Congo (NC -PlaNet), the Austronesian (AN- 
PlaNet) and the Sino-Tibetan (ST-PlaNet). 

The emergence of the distribution of occurrence 
of the consonants across the languages of a fam- 
ily can be explained through a growth model for 
the PlaNet representing the family. We employ the 
preferential attachment based growth model intro- 



duced in (Choudhury etal., 2006) and later ana- 
lytically solved in ( |Peruani et al., 2007] ) to explain 
this emergence for each of the five families. The 
model involves a single parameter that is essen- 
tially meant to introduce randomness in the oth- 
erwise predominantly preferential growth process. 
We observe that the families are significantly dif- 
ferent from one another in terms of this parame- 
ter value. We further observe that if we combine 
the inventories for all the families together and 
then attempt to fit this new data with our model, 
the value of the parameter is significantly differ- 
ent from that of the individual families. This in- 
dicates that the dynamics within the families is 
quite different from that across them. There are 
possibly two factors that regulate this dynamics: 
the innate preference of the speakers towards ac- 
quiring certain linguistic structures over others and 
shared ancestry of the languages within a family. 
Finally, we present a brief evolutionary history of 
the five families and point to a possible correlation 



between their age and the model parameter. 

The rest of the paper is laid out as follows. Sec- 
tion [2] states the definition of PlaNet, briefly de- 
scribes the data source and outlines the construc- 
tion procedure for the five networks. In section [3] 
we review the growth model for the networks. 
In the next section, we present the experiments 
with the model parameter and the results obtained 
thereby, for each of the five families. We further 
explain the significance of each of these results in 
the same section. We conclude in section [5] by 
summarizing our contributions, pointing out some 
of the implications of the current work and indi- 
cating the possible future directions. 

2 Definition and Construction of the 
Networks 

In this section, we revisit the definition of PlaNet, 
discuss briefly about the data source, and explain 
how we constructed the networks for each of the 
families. 

2.1 Definition of PlaNet 

PlaNet is a bipartite graph G = ( Vl,Vc,E p i ) con- 
sisting of two sets of nodes namely, Vl (labeled 
by the languages) and Vc (labeled by the conso- 
nants); E p i is the set of edges running between Vl 
and Vc- There is an edge e 6 E p i from a node 
vi S Vl to a node v c € Vc iff the consonant c is 
present in the inventory of the language I. FigureQ] 
illustrates the nodes and edges of PlaNet. 

2.2 Data Source 

We use the UCLA Phonological Segment Inven- 
tory Database (UPSID) ( |Maddieson, 1984| ) as the 
source of data for this work. The choice of this 
database is motivated by a large number of typo- 



logical studies (Lindblom and Maddieson, 1988 



|Ladefoged and Maddieson, 1996 |de Boer, 2 000 



|Hinskens and Weijer, 2 003 ) that have been carried 
out on it by the earlier researchers. There are 
317 languages in the database with 541 consonants 
found across them. From these data we manually 
sort the languages into five groups representing the 
five families. Note that we included a language 
in any group if and only if we could find a direct 
evidence of its presence in the corresponding fam- 
ily. A brief description of each of these groups and 
languages found within them are listed below. 
Indo-European: This family includes most of the 
major languages of Europe and south, central and 
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Figure 1: Illustration of the nodes and edges of 
PlaNet. 



south-west Asia. Currently, it has around 3 bil- 
lion native speakers, which is largest among all 
the recognized families of languages in the world. 
The total number of languages appearing in this 
family is 449. The earliest evidences of the Indo- 
European languages have been found to date 4000 
years back. 

Languages - Albanian, Lithuanian, Breton, Irish, 
German, Norwegian, Greek, Bengali, Hindi- 
Urdu, Kashmiri, Sinhalese, Farsi, Kurdish, Pashto, 
French, Romanian, Spanish, Russian, Bulgarian. 
Afro-Asiatic: Afro-Asiatic languages have about 
200 million native speakers spread over north, 
east, west, central and south-west Africa. This 
family is divided into five subgroups with a total of 
375 languages. The proto-language of this family 
began to diverge into separate branches approxi- 
mately 6000 years ago. 

Languages - Shilha, Margi, Angas, Dera, Hausa, 
Kanakuru, Ngizim, Awiya, Somali, Iraqw, Dizi, 
Kefa, Kullo, Hamer, Arabic, Amharic, Socotri. 
Niger-Congo: Majority of the languages that be- 
long to this family are found in the sub-Saharan 
parts of Africa. The number of native speakers 
is around 300 million and the total number of 
languages is 1514. This family descends from a 
proto-language, which dates back 5000 years. 
Languages - Diola, Temne, Wolof, Akan, Amo, 
Bariba, Beembe, Birom, Cham, Dagbani, Doayo, 
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IE-PlaNet 


19 


148 


534 


AA-PlaNet 


17 


123 


453 


NC-PlaNet 


30 


135 


692 


AN-PlaNet 


12 


82 


221 


ST-PlaNet 


9 


71 


201 



Table 1: Number of nodes and edges in the five 
bipartite networks corresponding to the five fami- 
lies. 

Efik, Ga, Gbeya, Igbo, Ik, Koma, Lelemi, Senadi, 
Tampulma, Tarok, Teke, Zande, Zulu, Kadugli, 
Moro, Bisa, Dan, Bambara, Kpelle. 
Austronesian: The languages of the Austronesian 
family are widely dispersed throughout the islands 
of south-east Asia and the Pacific. There are 1268 
languages in this family, which are spoken by a 
population of 6 million native speakers. Around 
4000 years back it separated out from its ancestral 
branch. 

Languages - Rukai, Tsou, Hawaiian, Iai, Adz- 
era, Kaliai, Roro, Malagasy, Chamorro, Tagalog, 
Batak, Javanese. 

Sino-Tibetan: Most of the languages in this fam- 
ily are distributed over the entire east Asia. With 
a population of around 2 billion native speakers it 
ranks second after Indo-European. The total num- 
ber of languages in this family is 403. Some of the 
first evidences of this family can be traced 6000 
years back. 

Languages - Hakka, Mandarin, Taishan, Jingpho, 
Ao, Karen, Burmese, Lahu, Dafla. 

2.3 Construction of the Networks 

We use the consonant inventories of the languages 
enlisted above to construct the five bipartite net- 
works - IE-PlaNet, AA-PlaNet, NC-PlaNet, AN- 
PlaNet and ST-PlaNet. The number of nodes and 
edges in each of these networks are noted in Ta- 
ble [Q 

3 The Growth Model for the Networks 

As mentioned earlier, we employ the growth 



model introduced in (Choudhury et al., 2006) 
and later (approximately) solved 
in ( |Peruani et al., 2007] ) to explain the emer- 
gence of the degree distribution of the consonant 
nodes for the five bipartite networks. For the 
purpose of readability, we briefly summarize the 
idea below. 



Degree Distribution: The degree of a node v, de- 
noted by k, is the number of edges incident on 
v. The degree distribution is the fraction of nodes 



Pk that have a degree equal to k dNewmaiy20 03 ). 
The cumulative degree distribution is the frac- 
tion of nodes having degree greater than or equal 
to k. Therefore, if there are N nodes in a network 
then, 



Pk 



N 
k=k' 



(1) 



Model Description: The model assumes that the 
size of the consonant inventories (i.e., the degree 
of the language nodes in PlaNet) are known a pri- 
ori. 

Let the degree of a language node Lj G Vl be 
denoted by di (i.e., refers to the inventory size of 
the language Li in UPSID). The consonant nodes 
in Vc are assumed to be unlabeled, i.e, they are 
not marked by the articulatory/acoustic features 
(see (Trubetzkoy, 1931) for further reference) that 
characterize them. The nodes L\ through L317 are 
sorted in the ascending order of their degrees. At 
each time step a node Lj, chosen in order, pref- 
erentially gets connected to dj distinct nodes (call 
each such node C) of the set Vc- The probability 
Pr(C) with which the node Lj gets connected to 
the node C is given by, 



Pr(C) 



k + e 



Eve' i k ' + e ) 



(2) 



where, k is the current degree of the node C, C 
represents the nodes in Vc that are not already 
connected to Lj and e is the model parameter that 
is meant to introduce a small amount of random- 
ness into the growth process. The above steps are 
repeated until all the language nodes Lj € Vl get 
connected to dj consonant nodes. 

Peruani et al. (2007 ) have shown that after some 
simplifications one can exactly solve this model 
analytically. Let the average consonant inventory 
size be denoted by /1 and the number of conso- 
nant nodes be N. The simplified model assumes 
that at each time step t a language node gets at- 
tached to p consonant nodes, following the distri- 
bution Pr(C). Under the above assumptions, the 
degree distribution p^j for the consonant nodes, 
obtained by solving the model, is a /3-distribution 
as follows 



Pk,t - A 



e-l 



dUL-e-l 



(3) 



where A is a constant term. Using equations Q] 
and [3] one can easily compute the value of Pkj- 
There is a subtle point that needs a mention here. 
The concept of a time step is very crucial for a 
growing network. It might refer to the addition of 
an edge or a node to the network. While these two 
concepts coincide when every new node has ex- 
actly one edge, there are obvious differences when 
the new node has degree greater than one. The 
analysis presented in Peruani et al. (120071) holds 
good for the case when only one edge is added 
per time step. However, if the degree of the new 
node being introduced to the system is much less 
than N, then Eq.[3]is a good approximation of the 
emergent degree distribution for the case when a 
node with more than one edge is added per time 
step. Therefore, the experiments presented in the 
next section attempt to fit the degree distribution 
of the real networks with Eq. [3] by tuning the pa- 
rameter e. 

4 Experiments and Results 

In this section, we attempt to fit the degree dis- 
tribution of the five empirical networks with the 
expression for P^ t described in the previous sec- 
tion. For all the experiments we set iV = 541, t = 
number of languages in the family under investi- 
gation and fi = average degree of the language 
nodes of the PlaNet representing the family under 
investigation, that is, the average inventory size for 
the family. Therefore, given the value of k we can 
compute pk } t and consequently, P^j, if e is known. 
We vary the value of e such that the logarithmic 
standard error (LSE) between the degree distribu- 
tion of the real network and the equation is least. 
LSE is defined as the sum of the square of the dif- 
ference between the logarithm of the ordinate pairs 
(say y and y ) for which the abscissas are equal. In 
other words LSE = (logy— logy') 2 . The best fits 
obtained for each of the five networks are shown 
in Figure 12 The values of e and the corresponding 
least LSE for each of them are noted in Table [2] 
Note that since we varied e in steps of 0.005 dur- 
ing the experiments, some of the differences in the 
values of e in Table [2] are not significant. Nev- 
ertheless, there are certain observations described 
below that are statistically significant according to 
the above experiments. 

Observation I: The very low value of the parame- 
ter e indicates that the choice of consonants within 
the languages of a family is strongly preferential. 
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Figure 2: The degree distribution of the different real networks along with the fits obtained from the 
equation for the optimal values of e. Black dots indicate plots for the real networks while the grey lines 
represent the curves obtained from the equation. For each of the families, the y-axis is in log-scale. The 
last figure showing the curve for the Combined-PlaNet is in doubly logarithmic scale. 



Network 


e for least LSE 


Value of LSE 


IE-PlaNet 


0.055 


0.16 


AA-PlaNet 


0.040 


0.24 


NC-PlaNet 


0.035 


0.19 


AN-PlaNet 


0.030 


0.17 


ST-PlaNet 


0.035 


0.03 


Combined-PlaNet 


0.070 


1.47 



Table 2: The values of e and the least LSE for the 
different networks. Combined-PlaNet refers to the 
network constructed after mixing all the languages 
from all the families. For all the experiments 

This innate preference for acquiring a particular 
set of consonants tends to grow over the linguis- 
tic generations ( |B levins, ^0 04). In this context, 



e may be thought of as modeling the (acciden- 
tal) errors or drifts that can occur during language 
transmission. The fact that the values of e across 
the four major language families, namely Afro- 
Asiatic, Niger-Congo, Sino-Tibetan and Austrone- 
sian, are comparable indicates that the rate of er- 
ror propagation is a universal factor that is largely 
constant across the families. The value of e for 
IE-PlaNet is slightly higher than the other four 
families, which might be an effect of higher di- 
versification within the family due to geographical 
or socio-political factors. Nevertheless, it is still 
smaller than the e of the Combined-PlaNet. 

The optimal e obtained for Combined-PlaNet is 



higher than that of all the families (see Table 0, 
though it is comparable to the Indo-European 
PlaNet. This points to the fact that the choice 
of consonants within the languages of a family is 
far more preferential than it is across the families; 
this fact is possibly an outcome of shared ances- 
try. In other words, the inventories of genetically 
related languages are similar (i.e., they share a lot 
of consonants) because they have evolved from the 
same parent language through a series of linguis- 
tic changes, and the chances that they use a large 
number of consonants used by the parent language 
is naturally high. 

Observation II: We observe a very interesting re- 
lationship between the approximate age of the lan- 
guage family and the values of e obtained in each 
case (see Table 0]). The only anomaly is the Indo- 
European branch, which possibly indicates that 
this might be much older than it is believed to be. 
In fact, a recent study ( |B alter, 20031 ) has shown 
that the age of this family dates back to 8000 years. 
If this last argument is assumed to be true then the 
values of e have a one-to-one correspondence with 
the approximate period of existence of the lan- 
guage families. As a matter of fact, this correlation 
can be intuitively justified - higher is the period 
of existence of a family higher are the chances of 
its diversification into smaller subgroups, which in 



turn increases the chances of transmission errors 
and hence, the values of e comes out to be more 
for the older families. It should be noted that the 
difference between the values of e for the language 
families are not statistically significant. Therefore, 
the aforementioned observation should be inter- 
preted only as an interesting possibility; more ex- 
perimentation is required for making any stronger 
claim. 

4.1 Control Experiment 

How could one be sure that the aforementioned 
observations are not an obvious outcome of the 
construction of the PlaNet or some spurious cor- 
relations? To this end, we conduct a control ex- 
periment where a set of inventories is randomly 
selected from UPSID to represent a family. The 
number of languages chosen is same as that of 
the PlaNets of the various language families. We 
observe that the average value of e for these ran- 
domly constructed PlaNets is 0.068, which, as one 
would expect, is close to that of the Combined- 
PlaNet. This reinforces the fact that the inherent 
proximity among the languages of a real family is 
not a consequence "by chance". 

4.2 Correlation between Families 

Another way to verify the above observation is to 
estimate the correlation between the frequency of 
occurrence of the consonants for the different lan- 
guage family pairs (i.e., how the frequencies of 
the consonants /p/, A/, Ikl, ImJ, /n/ . . . are corre- 
lated across the different families). Table [3] notes 
the value of this correlation among the five fami- 
lies. The values in Table [3] indicate that, in gen- 
eral, the families are very weakly correlated with 
each other, the average correlation being as low as 
~ 0.47. 

Note that, the correlation between the Afro- 
Asiatic and the Niger-Congo families is high not 
only because they share the same African origin, 
but also due to higher chances of language con- 
tacts among their groups of speakers. On the other 
hand, the Indo-European and the Sino-Tibetan 
families show least correlation because it is usu- 
ally believed that they share absolutely no genetic 
connections. Interestingly, similar trends are ob- 
served for the values of the parameter e. If we 
combine the languages of the Afro-Asiatic and the 
Niger-Congo families and try to fit the new data 
then e turns out to be 0.035 while if we do the same 
for the Indo-European and the Sino-Tibetan fam- 



Families 


IE 


AA 


NC 


AN 


ST 


IE 




0.49 


0.48 


0.42 


0.25 


AA 


0.49 




0.66 


0.53 


0.43 


NC 


0.48 


0.66 




0.55 


0.37 


AN 


0.42 


0.53 


0.55 




0.50 


ST 


0.25 


0.43 


0.37 


0.50 





Table 3: The Pearson's correlation between the 
frequency distributions obtained for the family 
pairs. IE: Indo-European, AA: Afro-Asiatic, 
NC: Niger-Congo, AN: Austronesian, ST: Sino- 
Tibetan. 



Families 


Age (in years) 


e 


Austronasean 


4000 


0.030 


Niger-Congo 


5000 


0.035 


Sino-Tibetan 


6000 


0.035 


Afro-Asiatic 


6000 


0.040 


Indo-European 


4000 (or 8000) 


0.055 



Table 4: Table showing the relationship between 
the age of a family and the value of e. 

ilies then e is 0.058. For many of the other com- 
binations the value of e and the correlation coeffi- 
cient have a one-to-one correspondence. However, 
there are clear exceptions also. For instance, if we 
combine the Afro-Asiatic and the Indo-European 
families then the value of e is very low (close to 
0.04) although the correlation between them is not 
very high. The reasons for these exceptions should 
be interesting and we plan to further explore this 
issue in future. 

5 Conclusion 

In this paper, we presented a method of network 
evolution to capture the emergence of linguistic 
diversity that manifests in the five major language 
families of the world. The bipartite network based 
growth model that we proposed in this paper can 
be associated with the process of language acqui- 
sition by an individual, which largely governs the 
course of language change in a linguistic commu- 
nity. In the initial years of language development 
every child passes through a stage called bab- 
bling during which he/she learns to produce non- 
meaningful sequences of consonants and vow- 
els, some of which are not even used in the lan- 
guage to which they are exposed ( jJakobson, 1968 



Locke, 1983| ). Clear preferences can be observed 
for learning certain sounds such as plosives and 
nasals, whereas fricatives and liquids are avoided. 
In fact, this hierarchy of preference during the 
babbling stage follows the cross-linguistic fre- 



quency distribution of the consonants. This innate 
frequency dependent preference towards certain 
phonemes might be because of phonetic reasons 
(i.e., for articulatory/perceptual benefits). In the 
current model, this innate preference gets captured 
through the process of preferential attachment. 

In fact, preferential attachment (PA) is a uni- 
versally observed evolutionary mechanism that is 
known to shape several physical, biological and 
socio-economic systems ( [Newman, 2003 [ ). 
This phenomenon has also been called 
for to explain various linguistic phenom- 
ena ([Choudhury and Mukherjee, to appear[). We 



believe that PA also provides a suitable abstraction 
for the mechanism of language acquisition. Ac- 
quisition of vocabulary and growth of the mental 
lexicon are few examples of PA in language 
acquisition. This work illustrates another variant 
of PA applied to explain the structure of consonant 
inventories and their diversification across the 
language families. 
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